For humanity life , people always researched many different things.This subject about water quality for humanity.Actually we are going to looking for true or false water. This is first my project and I am beginner a Data Scientist.I hope so I made it.
ph: pH of 1. water (0 to 14).
Hardness: Capacity of water to precipitate soap in mg/L.
Solids: Total dissolved solids in ppm.
Chloramines: Amount of Chloramines in ppm.
Sulfate: Amount of Sulfates dissolved in mg/L.
Conductivity: Electrical conductivity of water in μS/cm.
Organic_carbon: Amount of organic carbon in ppm.
Trihalomethanes: Amount of Trihalomethanes in μg/L.
Turbidity: Measure of light emiting property of water in NTU.
Potability: Indicates if water is safe for human consumption. Potable - 1 and Not potable - 0
# Default Libraries
import numpy as np # linear algebra
import pandas as pd # data processing
# Visualization Libraries
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import seaborn as sbn
import missingno as msgn
# Model Libraries
from sklearn.linear_model import LogisticRegression,RidgeClassifier,SGDClassifier,PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC,LinearSVC,NuSVC
from sklearn.neighbors import KNeighborsClassifier,NearestCentroid
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB
from sklearn.ensemble import VotingClassifier
# Data Pre-processing Libraries
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report , accuracy_score , confusion_matrix,precision_score
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler
# Off Warnings
import warnings
warnings.filterwarnings('ignore')
water_quality = pd.read_csv("water_potability.csv")
dataset = water_quality.copy()
# Dataset First Five Data - head()
dataset.head()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
# Dataset Info
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 2785 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 2495 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3114 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
# Descriptive Staticts
dataset.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ph | 2785.0 | 7.080795 | 1.594320 | 0.000000 | 6.093092 | 7.036752 | 8.062066 | 14.000000 |
| Hardness | 3276.0 | 196.369496 | 32.879761 | 47.432000 | 176.850538 | 196.967627 | 216.667456 | 323.124000 |
| Solids | 3276.0 | 22014.092526 | 8768.570828 | 320.942611 | 15666.690297 | 20927.833607 | 27332.762127 | 61227.196008 |
| Chloramines | 3276.0 | 7.122277 | 1.583085 | 0.352000 | 6.127421 | 7.130299 | 8.114887 | 13.127000 |
| Sulfate | 2495.0 | 333.775777 | 41.416840 | 129.000000 | 307.699498 | 333.073546 | 359.950170 | 481.030642 |
| Conductivity | 3276.0 | 426.205111 | 80.824064 | 181.483754 | 365.734414 | 421.884968 | 481.792304 | 753.342620 |
| Organic_carbon | 3276.0 | 14.284970 | 3.308162 | 2.200000 | 12.065801 | 14.218338 | 16.557652 | 28.300000 |
| Trihalomethanes | 3114.0 | 66.396293 | 16.175008 | 0.738000 | 55.844536 | 66.622485 | 77.337473 | 124.000000 |
| Turbidity | 3276.0 | 3.966786 | 0.780382 | 1.450000 | 3.439711 | 3.955028 | 4.500320 | 6.739000 |
| Potability | 3276.0 | 0.390110 | 0.487849 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
Ranges of variables
# Dataset Corr
dataset.corr()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| ph | 1.000000 | 0.082096 | -0.089288 | -0.034350 | 0.018203 | 0.018614 | 0.043503 | 0.003354 | -0.039057 | -0.003556 |
| Hardness | 0.082096 | 1.000000 | -0.046899 | -0.030054 | -0.106923 | -0.023915 | 0.003610 | -0.013013 | -0.014449 | -0.013837 |
| Solids | -0.089288 | -0.046899 | 1.000000 | -0.070148 | -0.171804 | 0.013831 | 0.010242 | -0.009143 | 0.019546 | 0.033743 |
| Chloramines | -0.034350 | -0.030054 | -0.070148 | 1.000000 | 0.027244 | -0.020486 | -0.012653 | 0.017084 | 0.002363 | 0.023779 |
| Sulfate | 0.018203 | -0.106923 | -0.171804 | 0.027244 | 1.000000 | -0.016121 | 0.030831 | -0.030274 | -0.011187 | -0.023577 |
| Conductivity | 0.018614 | -0.023915 | 0.013831 | -0.020486 | -0.016121 | 1.000000 | 0.020966 | 0.001285 | 0.005798 | -0.008128 |
| Organic_carbon | 0.043503 | 0.003610 | 0.010242 | -0.012653 | 0.030831 | 0.020966 | 1.000000 | -0.013274 | -0.027308 | -0.030001 |
| Trihalomethanes | 0.003354 | -0.013013 | -0.009143 | 0.017084 | -0.030274 | 0.001285 | -0.013274 | 1.000000 | -0.022145 | 0.007130 |
| Turbidity | -0.039057 | -0.014449 | 0.019546 | 0.002363 | -0.011187 | 0.005798 | -0.027308 | -0.022145 | 1.000000 | 0.001581 |
| Potability | -0.003556 | -0.013837 | 0.033743 | 0.023779 | -0.023577 | -0.008128 | -0.030001 | 0.007130 | 0.001581 | 1.000000 |
Corr : Effects of variables on dependent variable
# Potability 0 - 1
dataset.Potability.value_counts().plot(kind ='pie');
# I want to choose quality water and poor quality water
quality_water = dataset[dataset['Potability'] == 1]
poorquality_water = dataset[dataset['Potability'] == 0]
# ph range
ph07 = dataset[dataset['ph'] < 7]
ph714 = dataset[dataset['ph'] >= 7]
# labels - values
labels = ['ph(0 - 7) - Potability 1' , ' ph(7 - 14) - Potability 1' , 'ph(0-7) - Potability 0 ' , 'ph(7-14) - Potability 0' ]
values = [len(quality_water[dataset['ph'] < 7]) , len(quality_water[dataset['ph'] > 7]) , len(poorquality_water[dataset['ph'] < 7]) , len(poorquality_water[dataset['ph'] > 7])]
fig = go.Figure(data=[go.Pie(labels=labels , values=values , hole=.4)])
fig.show()
By separating the pH from 7 as smaller and larger, we have carried out an examination of the effect on quality and poor quality water.
Although this visualization is not a very accurate result, it is only for an overview.
sbn.scatterplot(x=dataset["ph"], y=dataset["Hardness"], hue=dataset.Potability,
data=dataset);
# Chloramines
sbn.scatterplot(x=dataset["ph"], y=dataset["Chloramines"], hue=dataset.Potability,
data=dataset);
# organic-carbon
sbn.scatterplot(x=dataset["ph"], y=dataset["Organic_carbon"], hue=dataset.Potability,
data=dataset);
sbn.catplot(x="Potability", y='ph', data=dataset, kind="box");
sbn.catplot(x="Potability", y='Hardness', data=dataset, kind="box");
sbn.catplot(x="Potability", y='Chloramines', data=dataset, kind="box");
# organic-carbon
sbn.catplot(x="Potability", y='Organic_carbon', data=dataset, kind="box");
dataset.drop('Potability', axis=1).hist(figsize=(12,8));
# Corr Graph
plt.figure(figsize = (15,9))
sbn.heatmap(dataset.corr(), annot = True);
We analyze the missing data, and we will be performing the filling process with machine learning on the missing data.
# Missing Data
dataset.isnull().sum()
ph 491 Hardness 0 Solids 0 Chloramines 0 Sulfate 781 Conductivity 0 Organic_carbon 0 Trihalomethanes 162 Turbidity 0 Potability 0 dtype: int64
We can divide them into logical and illogical incomplete data.
The logical ones exist for a reason.
Irrational ones are random ones.
dataset.isnull().mean()*100
ph 14.987790 Hardness 0.000000 Solids 0.000000 Chloramines 0.000000 Sulfate 23.840049 Conductivity 0.000000 Organic_carbon 0.000000 Trihalomethanes 4.945055 Turbidity 0.000000 Potability 0.000000 dtype: float64
As we have seen,
# Missing Data Count
msgn.bar(dataset , figsize=(10,5));
# Matrix Missing Data Set
msgn.matrix(dataset, figsize = (10 ,5));
var_names = list(dataset.columns)
numpy_dataset = np.array(dataset)
# KNN Imputer
knn_imputer = KNNImputer()
# Fill Missing Value
df = knn_imputer.fit_transform(numpy_dataset)
# DataFrame
df = pd.DataFrame(df , columns=var_names)
df.isnull().sum()
ph 0 Hardness 0 Solids 0 Chloramines 0 Sulfate 0 Conductivity 0 Organic_carbon 0 Trihalomethanes 0 Turbidity 0 Potability 0 dtype: int64
By using KNN Impute, we have performed the deletion of missing data in the data set.
I will keep going with "df".
# Select label and features
X = df.drop(columns=['Potability'] , axis = 1)
y = df['Potability']
# Train and Test Split
(x_train , x_test , y_train , y_test) = train_test_split(X, y , test_size=0.20 , random_state=99)
# Shape
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(2620, 9) (2620,) (656, 9) (656,)
# scaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
models =[("LR", LogisticRegression(max_iter=1000)),("SVC", SVC()),('KNN',KNeighborsClassifier(n_neighbors=10)),
("DTC", DecisionTreeClassifier()),("GNB", GaussianNB()),
("SGDC", SGDClassifier()),("Perc", Perceptron()),("NC",NearestCentroid()),
("Ridge", RidgeClassifier()),("NuSVC", NuSVC()),("BNB", BernoulliNB()),
('RF',RandomForestClassifier()),('ADA',AdaBoostClassifier()),
('XGB',GradientBoostingClassifier()),('PAC',PassiveAggressiveClassifier())]
results = []
names = []
finalResults = []
for name,model in models:
model.fit(x_train, y_train)
model_results = model.predict(x_test)
score = precision_score(y_test, model_results,average='macro')
results.append(score)
names.append(name)
finalResults.append((name,score))
finalResults.sort(key=lambda k:k[1],reverse=True)
# All models
all_models = pd.DataFrame (finalResults)
all_models
| 0 | 1 | |
|---|---|---|
| 0 | SVC | 0.680651 |
| 1 | KNN | 0.630238 |
| 2 | RF | 0.626869 |
| 3 | NuSVC | 0.609746 |
| 4 | XGB | 0.609443 |
| 5 | GNB | 0.596875 |
| 6 | ADA | 0.594315 |
| 7 | DTC | 0.578933 |
| 8 | LR | 0.562691 |
| 9 | Ridge | 0.562691 |
| 10 | SGDC | 0.516238 |
| 11 | Perc | 0.501906 |
| 12 | NC | 0.501146 |
| 13 | PAC | 0.476107 |
| 14 | BNB | 0.312500 |
SVC and KNN worked the best to train the model